Concurrent Execution of Critical Sections by Eliding Ownership of Locks

ABSTRACT

Critical sections of multi-threaded programs, normally protected by locks providing access by only one thread, are speculatively executed concurrently by multiple threads with elision of the lock acquisition and release. Upon a completion of the speculative execution without actual conflict as may be identified using standard cache protocols, the speculative execution is committed, otherwise the speculative execution is squashed. Speculative execution with elision of the lock acquisition, allows a greater degree of parallel execution in multi-threaded programs with aggressive lock usage.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.14/028,609 filed Sep. 17, 2013, which is a continuation of U.S.application Ser. No. 13/685,339 filed Nov. 26, 2012, now abandoned,which is a continuation application of Ser. No. 13/113,432 filed May 23,2011, now abandoned, which is a continuation of U.S. application Ser.No. 12/843,828 filed Jul. 26, 2010, now issued as U.S. Pat. No.7,962,699, which is a continuation of U.S. application Ser. No.11/539,731 filed Oct. 9, 2006, now issued as U.S. Pat. No. 7,765,364,which is a continuation of U.S. application Ser. No. 10/037,041 filedOct. 19, 2001, now issued as U.S. Pat. No. 7,120,762, all herebyincorporated by reference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under 9810114 awarded bythe National Science Foundation. The government has certain rights inthe invention.

BACKGROUND OF THE INVENTION

The present invention relates to computers with shared-memoryarchitectures and, in particular, to architectures providing a lockmechanism preventing conflicts when multiple program threads execute acommon, critical program section.

Multi-threaded software provides multiple execution “threads” which actlike independently executing programs. An advantage to suchmulti-threaded software is that each thread can be assigned to anindependent processor, or to a single processor that providesmulti-threaded execution so that the threads may be executed in parallelfor improved speed of execution. For example, a computer server for theInternet may use a multi-threaded server program where each separateclient transaction runs as a separate thread.

Each of the threads may need to modify common data shared among thethreads. For example, in the implementation of a transaction basedairline reservation system, multiple threads handling reservations fordifferent customers may read and write common data indicating the numberof seats available. If the threads are not coordinated in their use ofthe common data

I serious error can occur. For example, a first thread may read avariable indicating an airline seat is available and then set thatvariable indicating that the seat has been reserved by the thread'sclient. If a second thread reads the same variable prior to its settingby the first thread, the second thread may, based on that read,erroneously set that variable again with the result that the seat isdouble booked.

To avoid these problems, it is common to use synchronizing instructionsfor portions of a thread (often called critical sections) wheresimultaneous execution by more than one thread would be a problem. Acommon set of synchronizing instructions implement a lock, using a lockvariable having one value indicating that it is owned by a thread andanother value indicating that it is available. A thread must acquire thelock before executing the critical section and does so by reading thelock variable and if it is not held, writing a value to it indicatingthat it is held. When the critical section is complete, the thread againwrites to the lock variable a value indicating that the lock isavailable again.

Typically, the instructions used to acquire the lock are “atomicinstructions”, that is, instructions that cannot be interrupted oncebegun by any other thread or quasi-atomic instructions that can beinterrupted by another thread, but that make such interruption evidentto the interrupted thread so that the instructions can be repeated.

While the mechanism of locking a critical section for use by a singlethread effectively solves conflict problems, it can reduce the benefitsof parallel execution of threads by effectively serializing the threadsas they wait for a lock. This serialization can be reduced by using anumber of different locks associated, for example, with different smallportions of shared memory. In this way, the chance of different threadswaiting for a lock on a given portion of shared memory is reduced.

Multiple locks increase the complexity of the programming process andthus creates a tradeoff between program performance and programdevelopment time.

Ideally, a software tool might be created that could review and correctfor overly aggressive use of lock variables by reviewing criticalsections in all threads and determining whether a more narrowly definedlocking might be employed. The capability of any such a software tool,however, is limited to static analysis of the software and cannot detectlocking that is unnecessary during dynamic execution of the software,

SUMMARY OF THE INVENTION

A key insight to the present invention is that it may be possible toexecute a critical program section correctly without acquisition of thelock. In many situations a critical section may be executed by multiplethreads simultaneously with no actual conflict. This can be for a numberof reasons, including the possibility that the different threads areupdating different fields of the shared memory block aggregated under asingle lock variable, or the store operations in the critical sectionare conditional and frequently do not require actual conflicting storeoperations.

In such cases, the steps of acquiring and releasing the lock areunnecessary and can be elided. The critical section can be speculativelyexecuted, assuming there will be no conflict, and in those cases wherean actual conflict does occur, the conflict can be detectedautomatically by existing cache protocol methods and execution of thecritical section can be re-performed.

Specifically then, the present in provides a method of coordinatingaccess to common memory by multiple program threads. Each given programthread first detects the beginning, of a critical section of the givenprogram thread in which conflicts to access of the common memory couldoccur resulting from execution of other program threads. The giventhread then speculatively executes the critical section. The speculativeexecution is committed only if there has been no conflict, and issquashed if there has been a conflict.

Thus, it is one object of the invention to allow parallel execution ofcritical sections by multiple threads, under the recognition that inmany cases, no actual conflict will occur.

The conflict may be another thread writing data that was read by thegiven program thread in the critical section, or another thread readingor writing data that was written by the given program thread. In oneembodiment, this conflict may be determined by invalidation of a cacheblock holding data of the critical section.

Thus, it is another object of the invention to utilize existing cacheprotocol mechanisms to provide an indication of whether there has beenactual conflict in the execution of the critical section.

Often, the critical section will be speculatively executed to its end.The end of the critical section may be detected by examining patterns ofinstructions typically associated with lock acquisitions. For example,the pattern may be a store instruction directed to an inferred lockvariable. In a similar way, the beginning of a critical section may bededuced by a lock acquisition pattern, including atomicread/modify/write instructions.

Thus, it is another object of the invention to infer the existence of acritical section without modification of existing software or compilers.This inference is possible in part because misprediction of a criticalsection carries with it very little penalty as will be discussed below.

in certain cases, the speculative execution will conclude at a “resourceboundary” placing physical limits on the ability to speculate for longcritical sections. For example, resource boundaries may be limits in thecache size used for the speculation or the write buffer size, as will bedescribed below, or other resources needed for speculatively execution.In such eases, where there is no actual conflict but simply a limitationor resources, the lock variable may be acquired by the given thread andthe speculative execution committed, and the given thread may thencontinue execution from the point at which the speculation was committedto the conclusion of the critical section.

Thus, it is another object of the invention to provide for the efficientexecution of arbitrarily long critical sections despite limitedresources.

The first step of detecting the critical section may include reading ofa lock variable and performing the second step of speculative executiononly if the lock variable is not held by another program thread.

Thus, it is another object of the invention to avoid performancedegradation in certain cases where the critical section experiences ahigh number of actual conflicts. If the lock has been acquired, theassumption may be made that another processor or thread had to acquirethe lock because of its inability to perform a method of the presentinvention.

The first step of detecting the critical section may include reading aprediction table holding historical data indicating past successes inspeculatively executing the critical section, and the speculativeexecution may be performed only when the prediction table indicates alikelihood of successful speculative execution of the critical sectionof above a predetermined threshold value.

Thus, it is another object of the invention to avoid speculation forcritical sections that are highly contested during actual execution ofthe program.

The critical section may begin with a lock acquisition section and mayend with a lock release section and the present invention may includethe step of eliding the lock acquisition and release.

Thus it is another object of the invention to eliminate the steps ofacquiring and releasing a lock variable when no actual conflict occursthus speeding execution of the critical section and allowing otherthreads to concurrently execute the critical section.

The speculative execution of the critical section may elide writeinstructions that do not change a value of memory location being writtento.

Thus it is another object of the invention to permit concurrentexecution even in the presence of a true conflict between threadsaccessing the same location and at least one performing a “silentwrite”, particularly in the case where cache invalidation procedures areused to detect conflicts.

After squashing the speculative execution of the critical section whenthere has been a conflict, the critical section may be re-executed apredetermined number of times or until there is no conflict. If thereremains a conflict after the repeated re-executions, the lock variablemay be acquired.

Thus, it is another object of the invention to allow adjustment of thedegree of speculation depending on empirical factors that may bedetermined.

The speculatively execution of the critical section may use a cachememory to record the speculative execution without visibility to otherprocessing units.

Thus, it is another object of the invention to provide a simple,speculative mechanism utilizing the cache structures available in manycomputer architectures.

The foregoing objects and advantages may not apply to all embodiments ofthe inventions and are not intended to define the scope of theinvention, for which purpose claims are provided. In the followingdescription, reference is made to the accompanying drawings, which forma part hereof, and in which there is shown by way of illustration, apreferred embodiment of the invention. Such embodiment also does notdefine the scope of the invention and reference must be made thereforeto the claims for this purpose.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of the multi-processor system showingprocessors with their associated caches and cache controllers and thelock elision circuit of the present invention, communicating over anetwork with a common shared memory;

FIG. 2 is schematic representation of a critical section of a threadexecutable on a processor of FIG. 1, the critical section having apreceding acquire lock section and a succeeding release lock section andshowing example machine instructions to implement the same;

FIG. 3 is a diagrammatic representation of the serialization of multiplethreads caused by contention for a lock for a common critical sectionassociated with a block of shared memory;

FIG. 4 is a figure similar to that of FIG. 3 showing parallelization ofthe same critical sections under the present invention; and

FIG. 5 is a flow chart showing the functions executed by the lockelision circuit of FIG. 1 in implementing the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Referring now to FIG. 1, a multiprocessor, shared memory computer 10suitable for use with the present invention includes a number ofprocessor units 12 connected via a bus structure 14 to a common, sharedmemory 17. The shared memory 17 is depicted logically as a singledevice, but in fact will often be distributed among the processor units12, according to methods well known in the art.

Processor units 12 include processor 16 communicating with an L1 cache18, an L2 cache 20, and a cache controller 22 as is well understood inthe art. The shared memory 17 includes a memory controller 19 executingstandard cache protocols to allow copying of shared data structure 25within the shared memory to various ones of the L2 caches 20 ofparticular processor units 12. The processor unit 12 may be granted“owner” status for writing to memory or “sharing” status allowing forreading of the memory. Change of status of the caches 20, for example,when another cache 20 seeks ownership or sharing of the shared datastructure 25, may be accomplished by transmission of the request to thencurrently owning or sharing caches 20 invalidating their contentsaccording to protocols well known in the art. Coherence of the cachesmay be implemented with any of a variety of different cache controlprotocols including generally “snooping” protocols and those employingdirectories, as known in the art, and the structure of the bus 14 may bevaried accordingly.

The processor units 12 also include the lock elision circuit 24 of thepresent invention whose operation will be described below.

In a multithreaded program, each processor unit 12 may execute adifferent thread in parallel. The following description of the presentinvention will be with respect to such a multiprocessor system.Nevertheless, it will be understood that such multithreaded programs canalso be executed on a single processor providing multi-threadingcapability and the present invention is equally applicable to suchsystems.

Referring now to FIG. 2, a program thread 26 of a multithreaded programmay include a critical section 28 where access to shared data structure25 occurs and conflicts by other threads 26 are possible. Accordingly,the critical section 28 may be preceded by an acquire lock section 30 inwhich a LOCK variable (not shown but typically part of the shared datastructure 25) is acquired. By convention other threads 26 may not access(read or write) data of shared data structure 25 (other than the LOCKvariable) while the LOCK variable is held by another thread 26. Acorresponding release lock section 32 follows the critical section 28 toallow release of the LOCK variable and access to the shared datastructure 25 again by other threads 26.

Referring now to FIG. 3, in the prior art, during a multi-threadedexecution of, for example, four threads 26 a through 26 d, the criticalsections 28 a through 28 d of the four threads 26 a through 26 d may allaccess shared data structure 25 associated with a given LOCK variable.As depicted, if thread 26 a is first to acquire the LOCK variable inpreparation for the execution of its critical section 28 a, all otherthreads 26 b through 26 d break out of their parallel execution and areserialized while waiting for the LOCK variable to be released from thethread 26 ahead of them. Thus, for example, thread 26 b arriving at theacquire locks section 30 shortly after the acquisition of the LOCK bythread 26 a, must wait until the release lock section 32 of criticalsection 28 a before initiating execution of critical section 28 b.During this waiting time, the thread 26 b “spins” as indicated by thedotted line during which execution stalls. As may be seen, the lastthread 26 d may be required to spin for up to three times the length ofexecution of the critical section 28 before being able to acquire theLOCK variable. In more complex programs with multiple critical sections28, or threads repeating execution of critical sections 28, the wait canbe arbitrarily longer.

Referring again to FIG. 2, entry into the critical section 28 may beinferred by observing a pattern of instructions that are typically usedfor acquiring and releasing a LOCK variable in the acquire lock section30 and the release lock section 32. For example, the acquire locksection 30 may follow an atomic read/modify/write instructions forloading the lock variable, testing the lock variable and storing thelock variable indicated in FIG. 2 by pseudo code 40.

The term “atomic” as used herein refers to an instruction that cannot beinterrupted by another thread before completion, or cannot beinterrupted before completion without detection. Typically, atomicread/modify/write instructions are readily distinguished from standardSTORE and LOAD instructions, and as used herein may include the wellknown TEST&SET instructions, or as shown, the LOAD LOCK/STORECONDITIONAL instructions or other equivalent atomic instruction.

These atomic read/modify/write instructions provide some indication ofthe acquisition of a lock. This indication can he reinforced by aRELEASE sequence having a store instruction directed to same address asthe atomic read/modify/write instructions of the ACQUISITION sequence,both indicated by pseudo code 42.

Thus patterns of instructions with common addresses can be used to inferthe acquire lock section 30 and release lock section 32 and thus thelocation of a critical section 28. It is important to note, that thisinferential detection of the start and end of a critical section 28 ispractical because perfect identification of critical sections 28 is notessential for operation of the invention, if a non-critical section iserroneously identified as a critical section, so long as there is noconflict during its speculative execution, commitment of the speculativeexecution may still occur without harm. On the other hand, if a criticalsection is not identified as such, it will simply execute normally.

In situations where an inferred critical section 28 proves at some pointduring its execution not to have been a critical section, for example,as suggested by a write to a supposed LOCK variable that does notrestore the LOCK variable to its pre-critical section “release” value,the preceding speculative execution may simply be committed and thewrite performed, so long as there has been no conflict. In this respect,lock acquisitions that do not use a single lock release value, forexample, those that may release a LOCK variable with any nonzero value,including processor identification values, may still be accommodated bythe present invention.

In an alternative embodiment, the invention contemplates the start(and/or end) of the critical section may be identified by one or morespecial delimiter instructions only used for critical sections. In thiscase the inference of the beginning of the critical section rises to thelevel of certainty, but changes in programming practices are requiredfor such a system, unlike that of the preferred embodiment describedabove.

Referring still to FIG. 2, actual machine code 44 of the acquire locksection 30 may provide further clues to identifying the beginning of thecritical section 28. Instructions i(1)-i(7) show an atomicread/modify/write sequence pattern used in the acquisition of a LOCKvariable, and in particular, an instruction sequence that uses aspecialized LOAD LOCK (ldl.sub.--1) instruction i(3) and the STORECONDITIONAL (stl_c) instruction i(6) which provide quasi atomicexecution and thus are frequently associated with the acquisition of aLOCK variable.

In this sequence, generally instructions i(1) and i(2) load the LOCKvariable and test it to see if it is available and if not branch toinstruction i(1). Instructions i(3) and i(4) execute only if the LOCKvariable is not held as tested by instructions i(1) and i(2). Theseinstructions i(3) and i(4) load the LOCK variable conditionally, meaningthat other attempted loads of this variable will be detected at thesubsequent store conditional instructions i(6).

If the LOCK variable is not held, instructions i(5), i(6) and i(7) areexecuted causing a conditional store of a “held” value into the LOCKvariable. Instruction i(7) tests to see if the STORE CONDITIONALinstruction was successful, and if not causes a repeat of the operationsstarting at instruction i(1) as true atomicity of instructions i(1)-i(7)was not obtained.

After the critical section 28, instruction i(16) executes the releaseLOCK variable via a store of the “release” value to the same address.

Referring also to FIG. 1, the lock elision circuit 24 may provide afilter detecting this or a similar pattern to determine the beginning ofa critical section 28. In the preferred embodiment, the pattern is aLOAD LOCK instruction followed within a predetermined number ofinstructions by a STORE CONDITIONAL instruction referencing the sameaddress.

The lock elision circuit 24 identifies the release lock section 32 andhence the end of the critical section 28 by the next STORE instructionto the same address.

The lock elision circuit 24 may include a table (not shown) linking byprogram counter, a prediction value that a particular instruction is thebeginning or end of a critical section 28, and this prediction value maybe modified by historical success in the prediction (indicated by a lacksquashing of the speculative execution of the critical section 28) aswill be described below. This prediction as to whether a criticalsection has been found, may be supplemented by a prediction as towhether speculative execution of the critical section will besuccessful, as will be described below.

Methods of inferring the beginning of a critical section are alsodiscussed in co-pending patent application Ser. No. 09/693,030 filedOct. 20, 2000 entitled “Method of Using Delays to Speed Processing ofInferred Critical Program Portions” assigned to the same assignee as thepresent application and hereby incorporated by reference.

Referring now to FIG. 4, generally, the present invention uses thisability to infer the beginning and end of a critical section 28 of athread 26, to change execution modes to execute the critical section 28speculative until its end. If at the end of the speculative execution,no actual conflict with another thread 26 has occurred, the speculativeexecution is committed. In this way, the present invention allows thecritical sections 28 of multiple ones of the four program threads 26 athrough 26 d to run concurrently provided there is no actual conflict inthe dynamic execution, but even though they access the same shared datastructure 25 which are subject to the same lock. For example, duringexecution of its critical section 28, thread 26 a may access a firstblock within shared data structure 25 while thread 26 b accesses asecond block within the same shared data structure 25. There is noactual conflict in such accesses although this fact may be undetectablestatically.

As a second example, thread 26 c executing the critical section 28 mayhave a STORE that may be conditionally executed to access the same blockas accessed by thread 26 a, yet dynamically this conditional store maynot be performed. In this case, again, there is no conflict, however, aconflict would be assumed from static inspection of the threads.

Alternatively, execution of thread 26 d, which in this example writes tothe same block as thread 26 b is delayed by means of its initialexecution speculatively (indicated by 26 d′) being squashed, however,this delay is much reduced over that obtained in the example of FIG. 3.

Referring now to FIG. 5, the initiation and management of thespeculative execution is controlled by the lock elision circuit 24(shown in FIG. 1). As each instruction is received for execution by theprocessor 16, the lock elision circuit detects, as indicated by decisionblock 60, whether an acquire lock section 30 is likely beingimplemented. This can be done by applying a filter to the instructionbuffer to look for the patterns described above. This process willtypically be done in hardware and in parallel with standard execution ofthe instructions When process block 60 detects a lock acquire section,standard execution is modified as will be described below.

If the instructions suggest that no LOCK variable is being acquired, thelock elision circuit 24 loops back while allowing standard execution ofthe instructions.

If on the other hand, the instructions suggest that a lock acquisitionis being undertaken, the lock elision circuit 24 proceeds to decisionblock 64 and the lock variable is read to see if the LOCK variable is inthe held state.

if the LOCK variable is held, the lock elision circuit 24 again loopsback, allowing standard execution which will continue with the executionof instructions i(2) through i(16) as written (as shown in FIG. 2).

In an alternative embodiment, at process block 64, the prediction tableforming part of the lock elision circuit 24 may be consulted to see ifprevious attempts at speculative execution of the critical section 28have been successful. The prediction table in this case may store theresults of the last N attempts at speculation, for example, indexed byprogram counter value for fast reference, and the lock elision circuitcan defer to standard execution if a certain percentage of the last Nspeculations were not successful.

If the LOCK variable is not held, as indicated by decision block 64, thelock elision circuit 24 proceeds to process block 65 and elides theacquire lock section 30 being instructions i(2)-i(7). The STORE ofinstruction i(6) may be suppressed because if speculative execution ofthe remainder of the critical section is successful, it will be undoneby the LOAD instruction i(16).

The lock elision circuit 24 then proceeds to process block 66 to beginexecution of the critical section 28 starting after instruction i(7) isexecuted. At this time, the shared data structure 25 necessary for thecritical section 28 will be loaded into cache L2 including typically theLOCK variable as was accessed by instruction i(1) and other data neededby the critical section 28. On the other hand, stores by the criticalsection 28 may be done to the L1 cache 18, which serves as a buffer forthe speculative execution of the critical section 28 now beingperformed, and prevents the effects of the instructions of the criticalsection from being observed by other processor units 12.

At any time during the execution of the critical section 28, amis-speculation may occur as detected by process block 68. Such amis-speculation occurs, as described in part above, if data read by thecurrent thread 26 in the critical section 28 is written to by anotherthread 26, or if data written to by the current thread 26 in thecritical section 28 is read or written to by another thread 26, eitherof which as would also cause invalidation of cache L2. Thus, standardcache protocol messages may be used to detect such a conflict.

Speculation per process block 66 continues until one of three conditionsdetected by the following three decision blocks 68, 76, and 80.

The first condition may be caused by the occurrence of a conflict suchas produces mis-speculation. This terminates the current speculativeexecution of the critical section 28 causing the lock elision circuit 24to squash the speculative execution (as indicated by process block 70)by flushing the L1 cache 18 and restoring the program counter of theprocessor 16 to the beginning of the critical section 28 detected atdecision block 60.

Following this squashing, if at decision block 72, a retry limit has notbeen exceeded, the lock elision circuit 24 proceeds back to decisionblock 60 to begin speculative execution of the critical section 28 againafter detecting the acquire lock section 30.

If the retry limit has been exceeded as checked at decision block 72,indicating that a certain number of retries has been performed withoutsuccessful speculative execution of the critical section 28, the lockelision circuit 24 branches to decision block 60 and a write to the LOCKvariable is completed per instructions i(1) through i(7) in standardexecution.

If at decision block 68, no mis-speculation has occurred, the lockelision circuit 24 checks at decision block 76 whether speculationresources have been exhausted. These resource boundaries may varydepending on the particular architecture of the computer 10 and itsspeculation mechanism, but generally include exhaustion of the L1 cache18 when used for speculation, or if a register checkpoint mechanism isused, as is well known for speculation, the cache 20 used to store theregister checkpoints for squashing has been exhausted, or in thosearchitectures in which a reorder buffer is used for recovery of branchmis-speculation, that buffer is exhausted.

In these situations where a resource boundary has been reached, butthere has been no conflict, squashing is not required at process block74, an acquisition of the lock may be performed and the lock elisioncircuit 24 may proceed with speculative execution from the point whereit stopped, the resources being made free by committing the speculationup to that point. If the lock cannot be acquired, the speculativeexecution is squashed as has been described.

A variation of the occurrence of a resource boundary, that is treated inthe same way, is the occurrence of a non-cacheable operation, such as awrite to an input/output (I/O) location. I/O differs from cacheablememory in that, for example, multiple writes of the same value to I/Omay not necessarily be ignored. Decision block 76 may also detect suchnon-cacheable operations.

At process block 80, the lock elision circuit 24 detects whether arelease lock section 32 has occurred being a STORE instruction using thesame address detected in the acquire lock section 30 detected atdecision block 60. If a lock release has occurred, the lock elisioncircuit 24 proceeds to process block 82 and the STORE instruction 16 iselided as the LOCK variable is already released because of the elisionof instruction i(5) at process block 65.

It will be recognized that if the critical section inferred by decisionblock 60 is not truly a critical section 28, the misidentified STOREinstructions may still be elided without harm as it can be guaranteedthat no intervening LOAD instructions by any thread have occurred whenspeculation is successful.

At process block 84, succeeding process block 82, the speculativeexecution is then committed by updating cache L2 with the L1 cache L1.

Referring again to FIG. 5, in a further embodiment of the presentinvention, the execution of STORE instructions within the criticalsection 28 may be examined to see if they are “silent stores”, that is,stores that do not change the value of the memory location to which thestore is directed. In so far as the speculation assumes for its successthat no other threads 26 access the shared data structure 25, theseSTORE instructions may be suppressed. Detection of silent storesrequires only that each STORE instruction within the critical section 28be reviewed to see if it would change the value at the target address.If not, the STORE instruction is elided.

This detection of silent stores allows parallel execution of criticalsections even when there are technically, true conflicts, that is,STORES by different threads to the same address. By suppressing thesilent STORE instructions, the threads do not create a write-event tothe shared data structure 25 such as would cause a mis-speculation inthe given thread 26 operating in the critical section 28.

It will be recognized that the above described invention may be used fornested critical sections 28 simply by buffering the states of thevariables required by the flow chart of FIG. 5. No memory orderingproblems exist because the speculative execution of the critical sectionhas the appearance of atomicity when the data accessed by the criticalsection has not been accessed by any other thread.

As will be understood from the above description, the presentinginvention is applicable to a wide range of different computerarchitectures and should not be construed to be limited to theparticular architecture described herein. The speculative execution ofthe critical section may employ other speculation mechanism includingthose employing, “register checkpoints” or “reorder buffers”, all well,known in the art. It is specifically intended that the present inventionnot be limited to the embodiments and illustrations contained herein,but that modified forms of those embodiments including portions of theembodiments and combinations of elements of different embodiments alsobe included as come within the scope of the following claims.

We claim:
 1. A method of coordinating access to common memory bymultiple program threads comprising the steps of: in each given programthread, (a) detecting the beginning of a critical section of the givenprogram thread in which conflicts to access of the common memory couldoccur resulting from execution of other program threads; (b)speculatively executing the critical section; and (c) committing thespeculative execution of the critical section if there has been noconflict and squashing the speculative execution of the critical sectionif there has been a conflict.
 2. The method of claim 1, wherein theconflict is: (a) another thread writing data read by the given programthread in the critical section, or (b) another thread reading or writingdata written by the given program thread.
 3. The method of claim 2wherein the conflict is detected by an invalidation of a cache blockholding data of the critical section.
 4. The method of claim 1 whereinthe speculative execution is committed at the end of the criticalsection.
 5. The method of claim 4 wherein the end of the criticalsection is detected by a pattern of instructions typically associatedwith a lock release.
 6. The method of claim 5 wherein the pattern ofinstructions is a store instruction to a deduced lock variable address.7. The method of claim 1 wherein the speculative execution is committedat a resource boundary limiting further speculation.
 8. The method ofclaim 7 including the step of: (d) if at step (c) there was no conflictfrom the execution of another thread acquiring a lock variable allowingthe given thread to have exclusive access to the critical section andcontinuing execution from the commitment point to the conclusion of thecritical section.
 9. The method of claim 1 wherein the speculativeexecution is committed upon the occurrence of a non cacheable operationlimiting further speculation.
 10. The method of claim 9 including thestep of: (d) if at step (c) there was no conflict from the execution ofanother thread acquiring a lock variable allowing the given thread tohave exclusive access to the critical section and continuing executionfrom the commitment point to the conclusion of the critical section. 11.The method of claim 1 wherein step (a) includes reading of a lockvariable and wherein step (b) is performed only when the lock variableis not held by another program thread.
 12. The method of claim 1 whereinstep (a) includes reading a prediction table holding historical dataindicating past successes in speculatively executing the criticalsection and wherein step (b) is performed only when the prediction tableindicates a likelihood of successful speculative execution of thecritical section of above a predetermined threshold.
 13. The method ofclaim 1 wherein step (a) deduces the beginning of a critical section bydetecting patterns of instructions typically associated with a lockacquisitions.
 14. The method of claim 13 wherein pattern includes anatomic read/modify/write sequence.
 15. The method of claim 1 wherein thecritical section is preceded by a lock acquisition section and includingthe step of eliding the lock acquisition before step (b).
 16. The methodof claim 1 wherein the critical section ends with a lock release sectionand including the step of eliding the lock release section after step(c) when at step (c) upon reaching the end of the critical section, noconflict from the execution of another thread occurred.
 17. The methodof claim 1 including the further step of: (d) after squashing thespeculative execution of the critical section if there has beenconflict, re-executing the critical section speculatively.
 18. Themethod of claim 17 wherein the speculative re-execution of the criticalsection is repeated up to a predetermined number of times until there isnot a conflict.
 19. The method of claim 17 wherein (d) if after thepredetermined number of tries there remains a conflict from theexecution of another thread, acquiring a lock variable allowing thegiven thread to have exclusive access to the critical section andcontinuing execution of the critical section from its beginning.
 20. Themethod of claim 1 wherein the speculation executes the critical sectionusing a cache memory to record the speculative execution withoutvisibility to other processing units.
 21. The method of claim 1 whereinthe speculation executes the critical section eliding write instructionsthat do not change a value of memory location being written to.
 22. Alock elision circuit for a computer architecture allowing the access ofcommon memory by multiple program threads, the circuit comprising: meansfor controlling the execution of each given program thread to: (a)detect the beginning of a critical section of the given program threadin which conflicts to access of the common memory could occur resultingfrom execution of other program threads; (b) speculatively execute thecritical section; and (c) commit the speculative execution of thecritical section if there has been no conflict and squashing thespeculative execution of the critical section if there has been aconflict.
 23. The lock elision circuit of claim 22 wherein the conflictis: (a) another thread writing data read by the given program thread inthe critical section, or (b) another thread reading or writing datawritten by the given program thread.
 24. The lock elision circuit ofclaim 23 wherein the computer architecture includes a cache and theconflict is detected by an invalidation of a cache block holding data ofthe critical section.
 25. The lock elision circuit of claim 22 whereinthe speculative execution is committed at the end of the criticalsection.
 26. The lock elision of claim 25 wherein the end of thecritical section is detected by a pattern of instructions typicallyassociated with a lock release.
 27. The lock elision circuit of claim 26wherein the pattern of instructions is a store instruction to a deducedlock variable address.
 28. The lock elision circuit of claim 22 whereinthe speculative execution is committed at a resource boundary limitingfurther speculation.
 29. The lock elision circuit of claim 28 whereinwhen there is no conflict from the execution of another thread acquiringa lock variable, the lock elision circuit allows the given thread tohave exclusive access to the critical section and continues executionfrom the commitment point to the conclusion of the critical section. 30.The lock elision circuit of claim 22 wherein the lock elision circuitreads a lock variable and speculatively executes the critical sectiononly when the lock variable is not held by another program thread. 31.The lock elision circuit of claim 22 wherein the speculative executionis committed upon the occurrence of a non cacheable operation limitingfurther speculation.
 32. The lock elision circuit of claim 31 includingthe step of: (d) if at step (c) there was no conflict from the executionof another thread acquiring lock variable allowing the given thread tohave exclusive access to the critical section and continuing executionfrom the commitment point to the conclusion of the critical section. 33.The lock elision circuit of claim 22 including a prediction tableholding historical data indicating past successes in speculativelyexecuting the critical section and wherein the lock elision circuitspeculatively executes the critical section only when the predictiontable indicates a likelihood of successful speculative execution of thecritical section of above a predetermined threshold.
 34. The lockelision circuit of claim 22 wherein the lock elision circuit determinesthe beginning of a critical section by detecting patterns ofinstructions typically associated with a lock acquisitions.
 35. The lockelision circuit of claim 34 wherein the pattern includes an atomicread/modify/write sequence.
 36. The lock elision circuit of claim 22wherein the critical section is preceded by a lock acquisition sectionand wherein the lock elision circuit elides the lock acquisition beforespeculation.
 37. The lock elision circuit of claim 22 wherein thecritical section ends with a lock release section and wherein the lockelision circuit elides the lock release section after speculation whenupon reaching the end of the critical section, no conflict from theexecution of another thread occurred.
 38. The lock elision circuit ofclaim 22 wherein after squashing the speculative execution of thecritical section, if there has been a conflict, the lock elision circuitre-executes the critical section speculatively.
 39. The lock elisioncircuit of claim 38 wherein the lock elision circuit repeats thespeculative re-execution of the critical section up to a predeterminednumber of times until there is not a conflict.
 40. The lock elisioncircuit of claim 39 wherein if after the predetermined number of triesthere remains a conflict from the execution of another thread, the lockelision circuit allows acquisition of a lock variable allowing the giventhread to have exclusive access to the critical section and continuingexecution of the critical section from its beginning.
 41. The lockelision circuit of claim 22 wherein the computer architecture includes acache memory and the lock elision circuit uses the cache memory torecord the speculative execution without visibility to other processingunits.
 42. The lock elision circuit of claim 22 wherein the lock elisioncircuit elides write instructions within the critical section that donot change a value of memory location being written to.